Class 1: Concepts for Predictive Modeling

by Cory Lanker, LLNL statistician

June 26, 2017, 10:00-11:30 in B543 R1001

This course is a four-class introduction to predictive modeling. It is for all skill levels and will use R statistics software. But as the course is presented with R notebooks, all the code is here. The first three classes are lectures that discuss (Mon) the concepts useful for prediction, (Tues) regression models, and (Wed) classification models. The last class (Thurs) is a prediction contest where teams will use this material in a competition to get the best prediction values.

In this class, I'll discuss concepts that are important for effective prediction models. Prediction is a data analysis task that asks what a response variable is for an observation with known predictor values. We'll skip other data analysis tasks that explore the relationship between the response variable and its predictor variables.

This class consists of four R notebooks in Jupyter. Installation help:

In this class:

  • Introduction: two simple methods but very different approaches
  • Prediction: conceptualizing prediction and the predictors
  • Assessment: assessing predictor performance and the bias-variance tradeoff
  • Example: prediction on an example data set

This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-PRES-733567

References for the course

This course is structured around Max Kuhn and Kjell Johnson's master's-level text on prediction modeling. The code is either from the book, especially when using their wonderful caret R package, or it's code I wrote. Their code is disseminated under their copyright. All data sets in this course are all publicly available.

References

  1. Breiman, Leo. "Statistical modeling: The two cultures." Statistical science 16.3 (2001): 199-231.
  2. Clarke, Bertrand, Ernest Fokoue, and Hao Helen Zhang. Principles and theory for data mining and machine learning. New York: Springer, 2009.
  3. Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. New York: Springer, 2001.
  4. James, Gareth, et al. An introduction to statistical learning. New York: Springer, 2013.
  5. Kuhn, Max, and Kjell Johnson. Applied predictive modeling. New York: Springer, 2013.
  6. Venables, William N., and Brian D. Ripley. Modern applied statistics with S. New York: Springer, 2002.

Course disclaimer

  • I am not a great R coder. My main goal is functionality. And most of the time it works!
  • I know technically $\leftarrow$ is the assignment operator, not $=$, and use both interchangeably.

R Statistical Computing Software

#  http://www.R-project.org
#  Copyright (C) 1995-2012 The R Core Team
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  A copy of the GNU General Public License is available at
#  http://www.r-project.org/Licenses/

Source functions and load libraries

In [1]:
# functions for this page
plotsize <- function(wd,ht){options(repr.plot.width=wd, repr.plot.height=ht)}
packagelist <- c("MASS", "FNN", "ellipse", "AppliedPredictiveModeling", "Hmisc", 
                 "missForest", "caret", "corrplot", "earth", "lattice", "e1071")
output <- sapply(packagelist, require, character.only = TRUE)
if (any(!output)){
    cat('Installing packages: ', packagelist[!output],' ...\n')
    install.packages(packagelist[!output])
    output <- sapply(packagelist[!output], require, character.only = TRUE)
}
# sessionInfo()
Loading required package: MASS
Loading required package: FNN
Loading required package: ellipse
Loading required package: AppliedPredictiveModeling
Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:base’:

    format.pval, round.POSIXt, trunc.POSIXt, units

Loading required package: missForest
Loading required package: randomForest
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:Hmisc’:

    combine

The following object is masked from ‘package:ggplot2’:

    margin

Loading required package: foreach
Loading required package: itertools
Loading required package: iterators
Loading required package: caret

Attaching package: ‘caret’

The following object is masked from ‘package:survival’:

    cluster

Loading required package: corrplot
Loading required package: earth
Loading required package: plotmo
Loading required package: plotrix
Loading required package: TeachingDemos

Attaching package: ‘TeachingDemos’

The following objects are masked from ‘package:Hmisc’:

    cnvrt.coords, subplot

Loading required package: e1071

Attaching package: ‘e1071’

The following object is masked from ‘package:Hmisc’:

    impute

Part 1. An introduction

Prediction is a data-centered task that I'll define:

  • We have a collection of observational units, $i = 1, \dots, n$.
  • For each observation, we know the values (or levels) for $p$ variables (predictors) and the value (or class) for a response variable.
  • For observation $i$: collect the predictors into a vector $\mathbf{x}_i = \left[x_{i1}, x_{i2}, \dots, x_{ip}\right]$, and call the response $y_i$.
  • For a new observational unit $\mathbf{x}^*$ with a known $y^*$ value (but pretending not to know it), what is a loss-minimizing guess $\hat{y}$ for its response?
    • for regression, the loss function is usually squared-error: $$L(\hat{y}, y^*) = (\hat{y}-y^*)^2$$
    • for classification, the loss function is usually misclassification error: $$L(\hat{y}, y^*) = \left\{ \begin{array}{cc} 0, & \hat{y}=y^* \\ 1, & \hat{y}\ne y^* \end{array} \right.$$
  • We aim for a prediction method that has as low of error (or loss) as possible, ideally zero error, meaning:
    • for regression, our prediction value $\hat{y}$ closely approximates the true $y^*$ (with $\hat{y}=y^*$ always for zero error)
    • for classification, our prediction class $\hat{y}$ is usually the true class $y^*$ (with $\hat{y}=y^*$ always for zero error)
In [2]:
# A toy example of nine observations with two predictors (n=9, p=2)

x1 = 1:9 * 2
set.seed(0)  # if rerun, you get same results due to set.seed(0) setting RNG
x2 = round(rnorm(9,10,5),1)
y = 5 + 0.5 * x1 + round(rnorm(9),1)
example1 = data.frame(y,x1,x2)
for (i in 1:9) rownames(example1)[i] = rawToChar(as.raw(64+i))
    
# create data to be predicted
xt1 = -1:10 * 2 + 1
set.seed(1)  # if rerun, you get same results due to set.seed(0) setting RNG
xt2 = round(rnorm(length(xt1),10,5),1)
truth = 5 + 0.5 * xt1 + round(rnorm(length(xt1)),1)  # we don't know the truth
example1t = data.frame(y=NA,x1=xt1,x2=xt2)  # our data has nothing for the response y
for (i in 1:length(xt1)) rownames(example1t)[i] = rawToChar(as.raw(73+i))
cat('Our data set, observations with y known:')
    t(example1)
cat('Observations to be predicted:')
    t(example1t)
    
plotsize(8,4)
par(mfrow=c(1,2))
plot(example1$x1, example1$y, pch=rownames(example1), xlim=c(0,20),
     xlab='predictor 1, x1', ylab='response, y', main="plot of y vs. x1")
plot(example1$x2, example1$y, pch=rownames(example1), xlim=c(0,20),
     xlab='predictor 2, x2', ylab='response, y', main="plot of y vs. x2")
Our data set, observations with y known:
ABCDEFGHI
y 8.47.8 7.2 7.9 9.710.711.613.313.1
x1 2.04.0 6.0 8.010.012.014.016.018.0
x216.38.4 16.616.412.1 2.3 5.4 8.510.0
Observations to be predicted:
JKLMNOPQRSTU
y NA NA NA NA NA NA NA NA NA NA NA NA
x1-1.0 1.03.0 5 7.09.0 11.013.015.017.019.021.0
x2 6.910.95.8 18 11.65.9 12.413.712.9 8.517.611.9

Simple approach #1: linear regression

Let's make a function $f(x1,x2)$ to predict $y$ that is a linear fit of $x1$ and $x2$. Then we will use the fit to predict $y$ for observations J through U.

In [3]:
lm.example1 = lm(y~., data=example1)
summary(lm.example1)
Call:
lm(formula = y ~ ., data = example1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0090 -0.4786 -0.1343  0.2074  1.6502 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  6.98310    1.50482   4.640  0.00354 **
x1           0.36085    0.07785   4.635  0.00356 **
x2          -0.05859    0.08325  -0.704  0.50795   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.01 on 6 degrees of freedom
Multiple R-squared:  0.8587,	Adjusted R-squared:  0.8116 
F-statistic: 18.23 on 2 and 6 DF,  p-value: 0.002821

Is our linear fit $\hat{y} \equiv f(\mathbf x) = 6.98 + 0.36\, x_1 - 0.06\, x_2$ close enough to the true way $y$ is generated from $x_1$ and $x_2$, which is $y = 5 + 0.5\, x_1$?

Note that besides the formula $5 + 0.5\, x_1$, truth consists of random noise, so it wouldn't be possible to predict the true values $y$. The best we can hope to do is predict the expected value of $y$ given the predictors $X$ (that is, without the additive noise). We'll see later that the error of our prediction values consists of an irreducible error term $\sigma^2$ (the variance of the noise added to $y$ values) that is a penalty we accept we cannot overcome.

In [4]:
yhat = predict(lm.example1, example1t)
plotsize(8,4)
par(mfrow=c(1,2))
plot(example1$x1, example1$y, pch=rownames(example1), xlim=c(-2,22), ylim=c(3,15),
     xlab='predictor 1, x1', ylab='response, y', main="plot of y vs. x1")
points(example1t$x1, yhat, pch=rownames(example1t), col='red', cex=1)
points(example1t$x1, truth, pch=rownames(example1t), col='orange', cex=.9)
points(example1t$x1, truth, col='orange', cex=2)
plot(example1$x2, example1$y, pch=rownames(example1), xlim=c(-2,22), ylim=c(3,15),
     xlab='predictor 2, x2', ylab='response, y', main="plot of y vs. x2")
points(example1t$x2, yhat, pch=rownames(example1t), col='red', cex=1)
points(example1t$x2, truth, pch=rownames(example1t), col='orange', cex=.9)
points(example1t$x2, truth, col='orange', cex=2)

Comments:

  • The linear fit structure is visible in the $x_1$ plot, but not in the $x_2$ plot. This is due to the fact $x_1$ is related to $y$ while $x_2$ is not.
  • The predicted values (red) are "close" to the true but unknown values (orange). One way to assess this closeness is by average absolute error $|y^* - \hat{y}|$.
In [5]:
cat(sprintf("The average absolute error for the linear fit is %.2f.\n", 
            mean(abs(yhat - truth))))
The average absolute error for the linear fit is 1.24.

The trouble with prediction is that we don't know what our average absolute error is as we don't know the true values $y^*$ for observations J through U. Without that knowledge, it is easy to read too much into the data and make fancier models... too fancy given the true simplistic relationship between $\mathbf{x}$ and $y$.

Ignoring $x_2$, doesn't the data beg for a polynomial fit of $x_1$ to account for the apparent cubic relationship?

  • what does the SME say about a cubic relationship between $x_1$ and $y$?
  • watch for signs that you are overfitting your data
In [6]:
# rather than have two data sets (cumbersome), have a single concatenated data set
#    where the first nine rows are the training data
if (nrow(example1) == 9){
    example1 = rbind(example1, example1t)
    example1$x12 = example1$x1 ^ 2
    example1$x13 = example1$x1 ^ 3
}
cubic.example1 = lm(y ~ x1 + x12 + x13, data=example1[1:9,])
head(example1,3)
summary(cubic.example1)
yx1x2x12x13
A8.4 2 16.3 4 8
B7.8 4 8.416 64
C7.2 6 16.636 216
Call:
lm(formula = y ~ x1 + x12 + x13, data = example1[1:9, ])

Residuals:
        A         B         C         D         E         F         G         H 
-0.119192  0.342929 -0.222655 -0.237662  0.376190 -0.002814 -0.396392  0.373737 
        I 
-0.114141 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.887302   0.802729  13.563  3.9e-05 ***
x1          -1.556932   0.329115  -4.731  0.00519 ** 
x12          0.198034   0.037252   5.316  0.00315 ** 
x13         -0.005798   0.001229  -4.717  0.00526 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3713 on 5 degrees of freedom
Multiple R-squared:  0.9841,	Adjusted R-squared:  0.9745 
F-statistic:   103 on 3 and 5 DF,  p-value: 6.482e-05

The cubic fit is $\hat{y} = 10.89 - 1.56\, x_1 + 0.198\, x_1^2 - 0.00580\, x_1^3$, yielding predicted values:

Demeaning the data matrix

Subtracting the mean may reduce variability in the estimates and increase $t$ values. For this case it's true for the linear coefficient.

A demeaned data matrix yields the cubic fit (will change $t$ values for importance for squared and linear coefficents), note though that the fit is the same:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.323810   0.187636  49.691 6.24e-08 ***
x1           0.664478   0.062771  10.586  0.00013 ***
x12          0.024107   0.005289   4.558  0.00607 ** 
x13         -0.005798   0.001229  -4.717  0.00526 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3713 on 5 degrees of freedom
  (12 observations deleted due to missingness)
Multiple R-squared:  0.9841,    Adjusted R-squared:  0.9745 
F-statistic:   103 on 3 and 5 DF,  p-value: 6.482e-05
In [7]:
istest = is.na(example1$y)
yhat2 = predict(cubic.example1, example1[istest,])
plotsize(8,4)
par(mfrow=c(1,2))
plot(example1$x1[!istest], example1$y[!istest], pch=rownames(example1)[!istest], 
     xlim=c(-2,22), ylim=c(3,15), xlab='predictor 1, x1', ylab='response, y', 
     main="plot of fits vs. x1")
points(example1$x1[istest], yhat2, pch=rownames(example1)[istest], 
       col='red', cex=1)
points(example1$x1[istest], truth, pch=rownames(example1)[istest], 
       col='orange', cex=.9)
points(example1$x1[istest], truth, col='orange', cex=2)
xs = -20:220/10
ys = cbind(1, xs, xs^2, xs^3) %*% cubic.example1$coef
plot(example1$x1[!istest], example1$y[!istest], pch=16, cex=0.75, col='black',
     xlim=c(-2,22), ylim=c(3,15), xlab='predictor 1, x1', ylab='response, y', 
     main=sprintf("average absolute error = %1.2f", mean(abs(yhat2-truth))))
lines(xs, ys, col='red')
for (j in which(istest)){lines(rep(example1$x1[j],2), c(yhat2[j-9],truth[j-9]), 
                               col='blue', lwd=3)}
points(example1$x1[istest], truth, pch=rownames(example1)[istest], 
       col='orange', cex=.9)
points(example1$x1[istest], truth, col='orange', cex=2)

By making the model more complex, the average absolute error increased from 1.24 to 2.10, an increase of 70%. This increased error for the complex model may be counterintuitive.

Obviously it's not knowing the true values that makes selecting the most appropriate prediction model difficult! In this course I'll present strategies for effective prediction.

Simple approach #2: nearest neighbor

In contrast to a linear fit of the data, now we'll perform nearest neighbor regression on the above data set. Whereas a predicted value is affected by all data points in a linear fit, the nearest neighbor prediction uses only the closest observation. Specifically, let $\hat{y}$ be the value of $y_i$ such $distance(x_i, x^*)$ is the smallest for all $i=1,\dots,n$.

In [8]:
example1.1nn <- knn.reg(train=example1[1:9,2:3], test=example1[-(1:9),2:3], 
                        y=example1$y[1:9], k=1)
str(example1.1nn)
List of 7
 $ call     : language knn.reg(train = example1[1:9, 2:3], test = example1[-(1:9), 2:3], y = example1$y[1:9],      k = 1)
 $ k        : num 1
 $ n        : int 12
 $ pred     : num [1:12] 7.8 7.8 7.8 7.2 9.7 10.7 9.7 9.7 13.1 13.3 ...
 $ residuals: NULL
 $ PRESS    : NULL
 $ R2Pred   : NULL
 - attr(*, "class")= chr "knnReg"
In [9]:
istest = is.na(example1$y)
yhat3 = example1.1nn$pred

plotsize(8,4)
par(mfrow=c(1,2))
plot(example1$x1, example1$y, pch=rownames(example1), xlim=c(-2,22), 
     ylim=c(3,15), xlab='predictor 1, x1', ylab='response, y', cex=0.75,
     main="plot of y vs. x1\nnn(red), linear(blue)")
points(example1t$x1, yhat3, pch=rownames(example1t), col='red', cex=1)
points(example1t$x1, truth, pch=rownames(example1t), col='orange', cex=.9)
points(example1t$x1, truth, col='orange', cex=2)
points(example1t$x1, yhat, pch=rownames(example1t), col='blue', cex=0.75)
for (j in which(istest)){lines(rep(example1$x1[j]-.2,2), 
                               c(yhat3[j-9],truth[j-9]), col='red', lwd=4)}
for (j in which(istest)){lines(rep(example1$x1[j]+.2,2), 
                               c(yhat[j-9],truth[j-9]), col='blue', lwd=4)}

plot(example1$x2, example1$y, pch=rownames(example1), xlim=c(-2,22), ylim=c(3,15),
     xlab='predictor 2, x2', ylab='response, y', main="plot of y vs. x2")
points(example1t$x2, yhat, pch=rownames(example1t), col='red', cex=1)
points(example1t$x2, truth, pch=rownames(example1t), col='orange', cex=.9)
points(example1t$x2, truth, col='orange', cex=2)
      
cat(sprintf("average absolute error for nearest neighbor = %1.2f", 
            mean(abs(yhat3-truth))))
average absolute error for nearest neighbor = 1.47

It's hard to compare performance of these prediction models given this setup, but we'll learn strategies to test which model is likely to generate better prediction values.

No free lunch argument to modeling

No prediction model will outperform another model on every possible data set

  • Each model is ideally suited for a particular relationship between $\mathbf{x}$ and $y$
  • Data models work best when there is an intelligible relationship between the inputs and output, and the system can be characterized with the help of subject matter experts (SMEs)
  • Algorithmic models work best when there is a mountain of data and the relationship between the inputs and output is obscure and well-hidden in the extremely large predictor space (span of $\mathbb{R}^p$)

Therefore, an analyst must consider a broad set of models for a given prediction task, like finding the right tool in your prediction modeling toolbox

An example with real data

In [10]:
plotsize(6,4)
plot(datasets::cars, main='Stopping distance vs. vehicle speed')
In [11]:
t(datasets::cars)
speed4 4 7 7 8 9 10 10 10 11 20 20 20 22 23 24 24 24 2425
dist2 10 4 22 16 10 18 26 34 17 52 56 64 66 54 70 92 93 12085
In [12]:
xpred = 0:3000/100
speed.lm = lm(dist ~ speed, data=datasets::cars)
ypred.lm = predict(speed.lm, data.frame(speed=xpred))
summary(speed.lm)
Call:
lm(formula = dist ~ speed, data = datasets::cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,	Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
In [13]:
# average brute (low value) and cover_tree (high value) -- 
#      a quirk of the algorithm for duplicate x values
speed.1nn = knn.reg(train=datasets::cars[,1], test=as.matrix(xpred), 
                    y=datasets::cars[,2], k=1, algorithm = "cover_tree")$pred
speed.1nn = 0.5*(speed.1nn + 
                 knn.reg(train=datasets::cars[,1], test=as.matrix(xpred), 
                         y=datasets::cars[,2], k=1, algorithm = "brute")$pred)
In [14]:
plotsize(8,4)
plot(datasets::cars, main='Stopping distance vs. vehicle speed', 
     xlim=c(1,29), ylim=c(-25,125))
lines(xpred, ypred.lm, col='blue')
lines(xpred, speed.1nn, col='red')
abline(h=0, lty=2)

Some takeaways:

  • nearest neighbor regression always gives a reasonable answer, while the linear model has negative stopping distances for low speeds (perhaps a transform or quadratic fit is required)
  • the linear model both is sensible when considering the relationship between stopping distance and speed and appears to fit better in the bulk of the data (speeds 10 through 20)
  • some improvement may be possible with predictor transformations or using more than 1 neighbor (e.g., use the average of two or three neighbors)

Part 2. Prediction goals

Three tasks for data analysis are

  1. prediction
  2. variable selection
  3. estimation

All involve this conceptual representation:

$\; \; y \leftarrow \big[$ nature $\big] \leftarrow x$

  1. In variable selection...
  2. In estimation...
  3. In prediction...

Prediction

We conceptualize our data set consisting of $n$ observations of $\mathbf{x}_i = \left[x_{i1}, x_{i2}, \dots, x_{ip}\right]$ each having a measured response $y_i$:

$\; \; y_i \leftarrow \big[$ nature $\big] \leftarrow \mathbf{x}_i , \; i=1,\dots,n$

Our task at hand is to predict what the unknown response $y^*_i$ is for an observed $\mathbf{x}^*_i = \left[x_{i1}^*, x_{i2}^*, \dots, x_{ip}^*\right]$ for $m$ observations $i=n+1,\dots,n+m$ :

$\; \; y_i^* \leftarrow \big[$ nature $\big] \leftarrow \mathbf{x}_i^* , \; i=n+1,\dots,n+m $

An approach to the task of prediction is by characterizing the structure of $\mathbf{x}$ and its relationship to $y$:

  1. how nature relates the predictors to the response
  2. use of data models to provide structure
  3. seeing the span of data in the predictor space (what's the population vs. the sample?)
  4. algorithms that make sensible $y$ from $\mathbf{x}$

Some considerations:

  1. how many data are there in the span of $\mathbf{X}$?

    • if the data are sparse in the predictor space $\mathbb{R}^p$, it's easier to make a model "too fancy"
    • if there were more data for the toy data of Part 1, then the appropriateness of a linear or cubic fit would be more apparent
  2. do the observations to be predicted $\mathbf{X}^*$ span the same space as $\mathbf{X}$?

    • prediction performance is sensitive to extrapolation
    • the data points $< 4$ in the toy example were more susceptible to error from the cubic fit than data points in the center of the distribution of $x$ (the 5 to 15 range).
  3. how are the observations $\mathbf{x}_i$ related to one another?

    • we prefer that the observations are independent; if not, can we handle dependencies?
    • if this assumption is violated, we have to be careful how we assess performance by withholding data without dependencies
  4. is it reasonable to assume that the black box is the same for all observations $\mathbf{x}_i$?

    • if the observations are not identically distributed then multiple models may be necessary
    • a situation where this is problematic is if half the data from $$\Big\{ \; y \leftarrow \big[\; \text{nature model 1}\; \big]\leftarrow \mathbf{x} \; \Big\}$$ and half the data from $$\Big\{ \; y \leftarrow \big[\; \text{nature model 2}\; \big]\leftarrow \mathbf{x} \; \Big\}$$

Complications

The task of prediction may seem as simple as (1) choosing a model, (2) plugging in data, and (3) generating prediction values for new data --- but rarely is it that straightforward.

Here are three factors that complicate prediction.

1. Span of data $\mathbf{X}$ is sparse over the predictor space $\mathbb{R}^p$

In [15]:
if (!exists("runnumber")) runnumber = 1
# run first time with n=10, then n=40
# set 'expand' to TRUE to double the density of observations
if (runnumber == 1) expand = FALSE else expand = TRUE
runnumber = 3 - runnumber
set.seed(0)
x <- runif(10)
y <- runif(10)
z <- 0.5 + rt(10,3)/5
data3d <- data.frame(x=x,y=y,z=z)
if (expand == TRUE){
    x <- runif(30)
    y <- runif(30)
    z <- 0.5 + rt(30,3)/5
    datanew <- data.frame(x=x,y=y,z=z)
    data3d <- rbind(data3d, datanew)
}
plotsize(4,4)
cz <- (data3d$z - min(0,min(data3d$z)))/max(1,diff(range(data3d$z)))
cloud(z~x+y,data=data3d,col=rgb(0.1+0.8*cz,0.25+0.5*cz,0.9-0.8*cz), cex=1, 
      pch=16, xlim=c(0,1), ylim=c(0,1), zlim=c(0,1))
pred <- data.frame(x=rep(1:19/20, each=19),y=rep(1:19/20,19))
data3d.lm <- lm(z ~ ., data=data3d)
pred.fit <- predict(data3d.lm, pred)
cz <- (pred.fit - 0.1)/0.8
cloud(pred.fit ~ pred$x + pred$y, col=rgb(0.1+0.8*cz,0.25+0.5*cz,0.9-0.8*cz), 
      cex=0.5, xlim=c(0,1), ylim=c(0,1), zlim=c(-.2,1.2))
In [16]:
data3d[1:10,]
cor(data3d[1:10,])
cor(data3d)
xyz
0.8966972 0.061786270.8114786
0.2655087 0.205974570.3698626
0.3721239 0.176556750.6115117
0.5728534 0.687022850.6551002
0.9082078 0.384103720.6618199
0.2016819 0.769841420.5115418
0.8983897 0.497699240.9063388
0.9446753 0.717618510.4031223
0.6607978 0.991906090.8508381
0.6291140 0.380035180.3833136
xyz
x 1.000000000-0.0013856880.4342481
y-0.001385688 1.0000000000.1097172
z 0.434248073 0.1097171681.0000000
xyz
x 1.000000000-0.0013856880.4342481
y-0.001385688 1.0000000000.1097172
z 0.434248073 0.1097171681.0000000

With only 10 points in this cube, it's hard to avoid seeing relationship between $z$ and the predictors $x$ and $y$, even though there is no relationship. As we gather more observations though, it's easier to see the lack of any relationship.

In [17]:
n=160   # Now with 4x density as the n=10 case
set.seed(0)
x <- runif(n)
y <- runif(n)
#z <- 1+0.5*x+0.5*y+rnorm(n)
z <- 0.5 + rnorm(n)/4
data3d <- data.frame(x=x,y=y,z=z)
plotsize(4,4)
cz <- (data3d$z - min(0,min(data3d$z)))/max(1,diff(range(data3d$z)))
cloud(z~x+y,data=data3d,col=rgb(0.1+0.8*cz,0.25+0.5*cz,0.9-0.8*cz), cex=1, 
      pch=16, xlim=c(0,1), ylim=c(0,1), zlim=c(0,1))
pred <- data.frame(x=rep(1:19/20, each=19),y=rep(1:19/20,19))
data3d.lm <- lm(z ~ ., data=data3d)
pred.fit <- predict(data3d.lm, pred)
cz <- (pred.fit - 0.1)/0.8
cloud(pred.fit ~ pred$x + pred$y, col=rgb(0.1+0.8*cz,0.25+0.5*cz,0.9-0.8*cz), 
      cex=0.5, xlim=c(0,1), ylim=c(0,1), zlim=c(-.2,1.2))

Lesson: more data coverage over the predictor space allows more complex models

2. Curse of dimensionality

Question 1: Say a predictor $x$ takes values uniformly between 0 and 1. If we want to make a prediction at $x=0.5$ and use all data within 0.1 away (between 0.4 and 0.6), what proportion of the data is available for making our prediction value?

In [18]:
n = 1000
set.seed(0)
x = runif(n)
sum(abs(x - 0.5) < .1)/n
0.2

Let's add predictor $u$ also taking values uniformly between 0 and 1. If we want to make a prediction at $(x,u)=(0.5,0.5)$ and use all data within Euclidean distance 0.1 from this point, now what proportion of the data is available?

In [19]:
n = 1000
set.seed(0)
x = matrix(runif(2*n),n,2)
sum(sqrt(rowSums((x - c(0.5,0.5))^2)) < .1)/n
0.029

The exact answer is $\pi(0.1)^2 \approx 0.0314$. If we keep adding dimensions, what would this proportion be for a variety of distances from the center of the unit cube?

In [20]:
distlist = c(0.1, 0.25, 0.50, 0.75, 1)
resmat = matrix(NA, 20, length(distlist))
n = 1e6
for (p in 1:nrow(resmat)){
    set.seed(p)
    x = matrix(runif(p*n),n,p)
    euclid_dist <- sqrt(rowSums((x - rep(0.5,p))^2))
    for (d in 1:ncol(resmat))
        resmat[p,d] = sum(euclid_dist < distlist[d])/n    
}
In [21]:
plotsize(7,4)
plot(1:20, resmat[,1], type='l', ylim=c(0,1), ylab='Proportion within distance d', 
     xlab='number of predictors on (0,1)', las=1, lwd=2)
for (d in 2:ncol(resmat))
    lines(1:20, resmat[,d], col=d+(d>4), lwd=2)
legend(c(15,20),c(.3,1),c("d=0.1","d=0.25","d=0.5","d=0.75","d=1"), 
       lty=1, col=c(1:4,6), lwd=2, cex=1.2, y.intersp=2.5)
title('Data near cube center becomes sparse as dimension increases')

Lesson: as dimension of predictor space $\mathbb{R}^p$ increases, it's harder for the data to adequately cover the space (so overfitting is easy to do)

3. Span of data $\mathbf{X}^*$ does not align with $\mathbf{X}$ on $\mathbb{R}^p$

  • Predicting winter clothing prices or sales may not be too successful if using only summer clothes data.

  • How accurate will prediction of 2018 MLB player statistics be using only 2017 data? The last five years? The last 50 years? What do you think would yield the best prediction model?

  • Assessment techniques that we'll learn will not rescue us from this pitfall. Performance validation assesses how well we will predict observations from the same system (the same $\mathbf{X}$ predictor space).

  • It may be better to think of ways future data could be different and test sensitivity to those differences that aren't in the prediction model.

Lesson: be careful to check that the new data to be predicted is similar to the data used for training the prediction model

Data pre-processing with preProcess

In [22]:
################################################################################
### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
### Copyright 2013 Kuhn and Johnson
### Web Page: http://www.appliedpredictivemodeling.com
### Contact: Max Kuhn (mxkuhn@gmail.com)
###
### Chapter 3: Data Pre-Processing
###
### Required packages: AppliedPredictiveModeling, e1071, caret, corrplot
###
### Data used: The (unprocessed) cell segmentation data from the
###            AppliedPredictiveModeling package.
###
################################################################################

################################################################################
### Section 3.1 Case Study: Cell Segmentation in High-Content Screening

library(AppliedPredictiveModeling)
data(segmentationOriginal)

## Retain the original training set
segTrain <- subset(segmentationOriginal, Case == "Train")

## Remove the first three columns (identifier columns)
segTrainX <- segTrain[, -(1:3)]
segTrainClass <- segTrain$Class

dim(segTrain)
head(segTrain,10)
  1. 1009
  2. 119
CellCaseClassAngleCh1AngleStatusCh1AreaCh1AreaStatusCh1AvgIntenCh1AvgIntenCh2AvgIntenCh3VarIntenCh1VarIntenCh3VarIntenCh4VarIntenStatusCh1VarIntenStatusCh3VarIntenStatusCh4WidthCh1WidthStatusCh1XCentroidYCentroid
2207932307Train PS 133.752040 819 1 31.92327205.8785 69.91688 18.80923 56.71535118.388140 0 0 32.16126 1 215 347
3207932463Train WS 106.646390 431 0 28.03883115.3155 63.94175 17.29564 37.67105 49.470520 0 0 21.18553 0 371 252
4207932470Train PS 69.150320 298 0 19.45614101.2947 28.21754 13.81897 30.00564 24.749540 0 2 13.39283 0 487 295
12207932484Train WS 109.416430 256 0 18.82857125.9388 13.60000 13.92294 18.64303 40.331750 0 2 17.54686 0 211 495
15207932459Train PS 104.278650 258 0 17.57085124.3684 22.46154 12.32497 17.74714 41.928530 0 2 17.66034 0 172 207
16207827779Train PS 77.991940 358 0 42.28363217.1316 42.32164 20.95648 42.31636 66.965710 0 0 19.43055 0 276 385
17207827784Train PS 13.659972 158 0 31.41060102.2119 41.49007 20.58588 40.65309109.643610 0 0 17.39602 0 239 404
19207827645Train WS 106.844370 315 0 294.76744491.7342 193.43522201.36869152.31210305.582831 1 1 20.69449 0 95 95
23208319730Train WS 84.654040 246 0 582.79574244.3447 86.40000307.29619110.35152 52.986821 0 0 17.51983 0 438 16
25208333252Train WS 123.793870 223 0 375.04695292.1831 112.10329164.78088179.79158363.499871 0 1 17.03668 0 386 14
In [23]:
# (C) Kuhn and Johnson, 2013
################################################################################
### Section 3.2 Data Transformations for Individual Predictors

## The column VarIntenCh3 measures the standard deviation of the intensity
## of the pixels in the actin filaments

max(segTrainX$VarIntenCh3)/min(segTrainX$VarIntenCh3)
870.887247202783
In [24]:
skewness(segTrainX$VarIntenCh3)
2.39162433392159
In [25]:
set.seed(0)
nv = rnorm(200)
skewness(nv)
skewness(exp(nv))
plotsize(8,3)
par(mfrow=c(1,3))
hist(nv, main='Histogram of Gaussian draws')
hist(exp(nv),12, main='exp(Gaussian) draws')
hist(segTrainX$VarIntenCh3)
0.462323847139098
3.3591015411372
In [26]:
skewvals <- sapply(segTrainX, skewness)
plotsize(4,3)
hist(skewvals,12)
In [27]:
a = which.min(skewvals); a
b = which.max(skewvals); b
ConvexHullPerimRatioCh1: 15
KurtIntenCh1: 65
In [28]:
plotsize(8,4)
par(mfrow=c(1,2))
hist(segTrainX[,a], 20, main=sprintf('%s skew=%.2f', names(a), skewvals[a]))
hist(segTrainX[,b], 20, main=sprintf('%s skew=%.2f', names(b), skewvals[b]))

Some modeling techniques work better if the predictors have no skew (such as a Gaussian distribution). Even more important is that the distribution of the response variable is reasonable for the chosen prediction model.

In [29]:
# (C) Kuhn and Johnson, 2013
## Use caret's preProcess function to transform for skewness
segPP <- preProcess(segTrainX, method = "BoxCox")    # caret::preProcess

## Apply the transformations
segTrainTrans <- predict(segPP, segTrainX)
In [30]:
#str(segPP)   # segPP is a collection of transformations (Output is long!)
In [31]:
## Results for a single predictor
segPP$bc$VarIntenCh3
Box-Cox Transformation

1009 data points used to estimate Lambda

Input data summary:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.8693  37.0600  68.1300 101.7000 125.0000 757.0000 

Largest/Smallest: 871 
Sample Skewness: 2.39 

Estimated Lambda: 0.1 
With fudge factor, Lambda = 0 will be used for transformations
In [32]:
preProcess(data.frame(a=segTrainX$KurtIntenCh1), method="BoxCox")
Created from 1009 samples and 0 variables

Pre-processing:
  - ignored (0)
In [33]:
# (C) Kuhn and Johnson, 2013
plotsize(3,3)
histogram(~segTrainX$VarIntenCh3,
          xlab = "Natural Units",
          type = "count")
histogram(~log(segTrainX$VarIntenCh3),
          xlab = "Log Units",
          ylab = " ",
          type = "count")
In [34]:
segPP$bc$PerimCh1
histogram(~segTrainX$PerimCh1,
          xlab = "Natural Units",
          type = "count")
histogram(~segTrainTrans$PerimCh1,
          xlab = "Transformed Data",
          ylab = " ",
          type = "count")
Box-Cox Transformation

1009 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  47.74   64.37   79.02   91.61  103.20  459.80 

Largest/Smallest: 9.63 
Sample Skewness: 2.59 

Estimated Lambda: -1.1 

Standardization

Standardizing predictors centers the variable such mean = 0 and then scales the variable to have variance = 1.

Standardization is important because a few prediction methods require matrix columns to have zero mean and unity variance.

In [35]:
set.seed(0); n=100
mat = data.frame(x1=rnorm(n,50,10), x2=rexp(n,0.04))
mat$y = 0.5 * mat$x1 + 0.25 * mat$x2 + rnorm(n,0,10)
In [36]:
head(mat)
x1x2y
62.62954 14.142566840.060641
46.73767 19.079500226.551162
63.29799 39.317917656.124349
62.72429 45.891024535.174083
54.14641 0.928806923.003291
34.60050 3.1391468 8.823942
In [37]:
transf = preProcess(mat, method="center")
#transf = preProcess(mat, method=c("center","scale"))
#transf = preProcess(mat, method=c("BoxCox"))
#transf = preProcess(mat, method=c("BoxCox","center","scale"))
In [38]:
transf
Created from 100 samples and 3 variables

Pre-processing:
  - centered (3)
  - ignored (0)
In [39]:
mat2 = predict(transf, mat)
In [40]:
plotsize(8,6)
par(mfrow=c(3,2))
for (i in 1:3){ 
    hist(mat[,i], main=sprintf('Original variable %s', names(mat)[i]))
    hist(mat2[,i], main=sprintf('Transformed variable %s', names(mat)[i]))
}

Data reduction using PCA

In [41]:
head(segTrainTrans)
AngleCh1AngleStatusCh1AreaCh1AreaStatusCh1AvgIntenCh1AvgIntenCh2AvgIntenCh3AvgIntenCh4AvgIntenStatusCh1AvgIntenStatusCh2VarIntenCh1VarIntenCh3VarIntenCh4VarIntenStatusCh1VarIntenStatusCh3VarIntenStatusCh4WidthCh1WidthStatusCh1XCentroidYCentroid
2133.752040 1.108458 1 2.153973 205.8785 8.586038 8.868197 0 0 1.726978 4.038045 10.6262100 0 0 1.647334 1 215 35.25587
3106.646390 1.106383 0 2.107163 115.3155 8.270837 7.723310 0 0 1.700598 3.628892 7.4110800 0 0 1.565479 0 371 29.74902
4 69.150320 1.104520 0 1.965095 101.2947 5.745593 4.938503 0 0 1.625520 3.401385 5.3953530 0 2 1.453495 0 487 32.35113
12109.416430 1.103554 0 1.951570 125.9388 3.960241 5.789944 0 0 1.628138 2.925472 6.7725170 0 2 1.522547 0 211 42.49719
15104.278650 1.103607 0 1.922613 124.3684 5.145002 6.734752 0 0 1.584569 2.876225 6.8909210 0 2 1.524084 0 172 26.77499
16 77.991940 1.105523 0 2.249339 217.1316 6.919585 6.611369 0 0 1.759692 3.745174 8.4328200 0 0 1.546281 0 276 37.24283
In [42]:
rbind(apply(segTrainTrans, 2, mean), apply(segTrainTrans, 2, sd))
AngleCh1AngleStatusCh1AreaCh1AreaStatusCh1AvgIntenCh1AvgIntenCh2AvgIntenCh3AvgIntenCh4AvgIntenStatusCh1AvgIntenStatusCh2VarIntenCh1VarIntenCh3VarIntenCh4VarIntenStatusCh1VarIntenStatusCh3VarIntenStatusCh4WidthCh1WidthStatusCh1XCentroidYCentroid
91.12641 0.5629336 1.1036473150.08126858 2.3894270 185.1907 8.624955 7.220966 0.2021804 0.5996036 1.9474026 4.1922873 9.466927 0.2091179 0.3865213 0.4212091 1.50734701 0.2675917 266.2220 23.362533
48.91190 0.7882154 0.0027032510.27338265 0.2484482 153.9866 3.582950 2.948910 0.4018252 0.8202786 0.1665309 0.9764211 3.802129 0.4068805 0.6981102 0.7291997 0.07424682 0.6056531 139.2667 8.461666
In [43]:
# (C) Kuhn and Johnson, 2013
################################################################################
### Section 3.3 Data Transformations for Multiple Predictors

## R's prcomp is used to conduct PCA
pr <- prcomp(~ AvgIntenCh1 + EntropyIntenCh1, 
             data = segTrainTrans, 
             scale. = TRUE)
# Note: segTrainTrans is the transformed data

transparentTheme(pchSize = .7, trans = .3)

plotsize(6,4)
xyplot(AvgIntenCh1 ~ EntropyIntenCh1,
       data = segTrainTrans,
       groups = segTrain$Class,
       xlab = "Channel 1 Fiber Width",
       ylab = "Intensity Entropy Channel 1",
       auto.key = list(columns = 2),
       type = c("p", "g"),
       main = "Original Data: 2 variable PCA",
       aspect = 1)
In [44]:
# (C) Kuhn and Johnson, 2013
xyplot(PC2 ~ PC1,
       data = as.data.frame(pr$x),
       groups = segTrain$Class,
       xlab = "Principal Component #1",
       ylab = "Principal Component #2",
       main = "Transformed: 2 variable PCA",
       xlim = extendrange(pr$x),
       ylim = extendrange(pr$x),
       type = c("p", "g"),
       aspect = 1)
In [45]:
## Apply PCA to the entire set of predictors.

# Note: this is a lot of predictors for PCA
dim(segTrainX)
  1. 1009
  2. 116
In [46]:
head(apply(segTrainX, 2, sd))
AngleCh1
48.9119002110842
AngleStatusCh1
0.788215356967476
AreaCh1
216.555275887585
AreaStatusCh1
0.273382646927243
AvgIntenCh1
164.013578649018
AvgIntenCh2
153.98661617808
In [47]:
## There are a few predictors with only a single value, so we remove these first
## (since PCA uses variances, which would be zero)

isZV <- apply(segTrainX, 2, function(x) length(unique(x)) == 1)
segTrainX <- segTrainX[, !isZV]

segPP <- preProcess(segTrainX, c("BoxCox", "center", "scale"))
segTrainTrans <- predict(segPP, segTrainX)
# now the data are centered and scaled
rbind(apply(segTrainTrans, 2, mean), apply(segTrainTrans, 2, sd))
AngleCh1AngleStatusCh1AreaCh1AreaStatusCh1AvgIntenCh1AvgIntenCh2AvgIntenCh3AvgIntenCh4AvgIntenStatusCh1AvgIntenStatusCh2VarIntenCh1VarIntenCh3VarIntenCh4VarIntenStatusCh1VarIntenStatusCh3VarIntenStatusCh4WidthCh1WidthStatusCh1XCentroidYCentroid
4.713604e-17 -6.432768e-173.875231e-14 -2.122065e-17-6.087779e-167.03774e-17 -4.399733e-17-9.429314e-173.323923e-17 -5.781919e-17-4.901915e-164.000122e-16 2.160487e-16 -5.600812e-172.077542e-17 -6.486645e-17-1.059703e-18-2.222109e-17-1.589029e-16-3.988338e-17
1.000000e+00 1.000000e+001.000000e+00 1.000000e+00 1.000000e+001.00000e+00 1.000000e+00 1.000000e+001.000000e+00 1.000000e+00 1.000000e+001.000000e+00 1.000000e+00 1.000000e+001.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
In [48]:
# (C) Kuhn and Johnson, 2013
segPCA <- prcomp(segTrainTrans, center = TRUE, scale. = TRUE)

## Plot a scatterplot matrix of the first three components
transparentTheme(pchSize = .8, trans = .3)

panelRange <- extendrange(segPCA$x[, 1:3])
plotsize(6,6)
splom(as.data.frame(segPCA$x[, 1:3]),
      groups = segTrainClass,
      type = c("p", "g"),
      as.table = TRUE,
      auto.key = list(columns = 2),
      prepanel.limits = function(x) panelRange)

Among the 116 predictors, it's easy to see why the first three principal components help separate the data into the ps and ws classes.

In [49]:
splom(as.data.frame(segPCA$x[, c(1,4,5)]),
      groups = segTrainClass,
      type = c("p", "g"),
      as.table = TRUE,
      auto.key = list(columns = 2),
      prepanel.limits = function(x) panelRange)

It's not as easy to see how PC4 and PC5 help differentiate the classes. Note the strong blue shapes of PC1 vs. PC4 and PC5, meaning PC4 and PC5 do offer help in discrimination of ws from ps.

In [50]:
str(segPCA)    # Note that center and scale must be 0 and 1 for PCA
List of 5
 $ sdev    : num [1:114] 3.99 3.79 3.28 2.7 2.12 ...
 $ rotation: num [1:114, 1:114] 0.0046 -0.0188 0.0776 0.0841 -0.182 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:114] "AngleCh1" "AngleStatusCh1" "AreaCh1" "AreaStatusCh1" ...
  .. ..$ : chr [1:114] "PC1" "PC2" "PC3" "PC4" ...
 $ center  : Named num [1:114] 4.70e-17 -6.43e-17 3.88e-14 -2.11e-17 -6.09e-16 ...
  ..- attr(*, "names")= chr [1:114] "AngleCh1" "AngleStatusCh1" "AreaCh1" "AreaStatusCh1" ...
 $ scale   : Named num [1:114] 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "names")= chr [1:114] "AngleCh1" "AngleStatusCh1" "AreaCh1" "AreaStatusCh1" ...
 $ x       : num [1:1009, 1:114] 4.356 -0.544 3.551 -0.432 0.483 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:1009] "2" "3" "4" "12" ...
  .. ..$ : chr [1:114] "PC1" "PC2" "PC3" "PC4" ...
 - attr(*, "class")= chr "prcomp"
In [51]:
plotsize(6,3)
varexpl = (segPCA$sdev[1:30]^2)/sum(segPCA$sdev^2)
plot(varexpl, ylim=c(0,max(varexpl)),
     cex=.75, type='l', ylab='Percent of Total Variance',
     main="Scree plot: explained variance percentage")
points(varexpl, pch=16, cex=.75)
abline(h=0, v=5.5, lty=2)

After PC4, no other principal component explains more than four percent of the explained discrimination.

In [52]:
plotsize(5,5)
splom(as.data.frame(segPCA$x[, c(1,6)]),
      groups = segTrainClass,
      type = c("p", "g"),
      as.table = TRUE,
      auto.key = list(columns = 2),
      prepanel.limits = function(x) panelRange)
In [53]:
isx = (segPCA$x[,1] > 0 & segPCA$x[,1] < 1.5)
table(segTrainClass[isx])
isx2 = (segPCA$x[,1] > 0 & segPCA$x[,1] < 1.5 & 
        segPCA$x[,6] > 1 & segPCA$x[,6] < 1.5)
table(segTrainClass[isx2])
 PS  WS 
109  37 
PS WS 
 8 12 

PC6 is still informative, as shown above. Without PC6, classifying PS and WS for values of PC1 between 0 and 1.5 would have a misclassification score of 37/146 = 25.3%. Using PC6, we improve the misclassification score in this range to 33/146 = 22.6%. The other principal components make small contributions to improving the overall classifier.

In [54]:
# (C) Kuhn and Johnson, 2013
## Format the rotation values for plotting
segRot <- as.data.frame(segPCA$rotation[, 1:3])

## Derive the channel variable
vars <- rownames(segPCA$rotation)
channel <- rep(NA, length(vars))
channel[grepl("Ch1$", vars)] <- "Channel 1"
channel[grepl("Ch2$", vars)] <- "Channel 2"
channel[grepl("Ch3$", vars)] <- "Channel 3"
channel[grepl("Ch4$", vars)] <- "Channel 4"
In [55]:
# to understand the previous code:
head(vars)
head(grepl("Ch1$", vars))
  1. 'AngleCh1'
  2. 'AngleStatusCh1'
  3. 'AreaCh1'
  4. 'AreaStatusCh1'
  5. 'AvgIntenCh1'
  6. 'AvgIntenCh2'
  1. TRUE
  2. TRUE
  3. TRUE
  4. TRUE
  5. TRUE
  6. FALSE
In [56]:
head(segRot)
PC1PC2PC3
AngleCh1 0.004595073-0.006922644 0.001496270
AngleStatusCh1-0.018808055 0.001886987-0.002674549
AreaCh1 0.077588836 0.217192292 0.086041657
AreaStatusCh1 0.084107801 0.158731069 0.057593826
AvgIntenCh1-0.181988868 0.053447704 0.016344246
AvgIntenCh2-0.183982077 0.035323460 0.031343661
In [57]:
segRot$Channel <- channel
head(segRot)
PC1PC2PC3Channel
AngleCh1 0.004595073-0.006922644 0.001496270Channel 1
AngleStatusCh1-0.018808055 0.001886987-0.002674549Channel 1
AreaCh1 0.077588836 0.217192292 0.086041657Channel 1
AreaStatusCh1 0.084107801 0.158731069 0.057593826Channel 1
AvgIntenCh1-0.181988868 0.053447704 0.016344246Channel 1
AvgIntenCh2-0.183982077 0.035323460 0.031343661Channel 2
In [58]:
complete.cases(segRot)
tail(segRot)
  1. TRUE
  2. TRUE
  3. TRUE
  4. TRUE
  5. TRUE
  6. TRUE
  7. TRUE
  8. TRUE
  9. TRUE
  10. TRUE
  11. TRUE
  12. TRUE
  13. TRUE
  14. TRUE
  15. TRUE
  16. TRUE
  17. TRUE
  18. TRUE
  19. TRUE
  20. TRUE
  21. TRUE
  22. TRUE
  23. TRUE
  24. TRUE
  25. TRUE
  26. TRUE
  27. TRUE
  28. TRUE
  29. TRUE
  30. TRUE
  31. TRUE
  32. TRUE
  33. TRUE
  34. TRUE
  35. TRUE
  36. TRUE
  37. TRUE
  38. TRUE
  39. TRUE
  40. TRUE
  41. TRUE
  42. TRUE
  43. TRUE
  44. TRUE
  45. TRUE
  46. TRUE
  47. TRUE
  48. TRUE
  49. TRUE
  50. TRUE
  51. TRUE
  52. TRUE
  53. TRUE
  54. TRUE
  55. TRUE
  56. TRUE
  57. TRUE
  58. TRUE
  59. TRUE
  60. TRUE
  61. TRUE
  62. TRUE
  63. TRUE
  64. TRUE
  65. TRUE
  66. TRUE
  67. TRUE
  68. TRUE
  69. TRUE
  70. TRUE
  71. TRUE
  72. TRUE
  73. TRUE
  74. TRUE
  75. TRUE
  76. TRUE
  77. TRUE
  78. TRUE
  79. TRUE
  80. TRUE
  81. TRUE
  82. TRUE
  83. TRUE
  84. TRUE
  85. TRUE
  86. TRUE
  87. TRUE
  88. TRUE
  89. TRUE
  90. TRUE
  91. TRUE
  92. TRUE
  93. TRUE
  94. TRUE
  95. TRUE
  96. TRUE
  97. TRUE
  98. TRUE
  99. TRUE
  100. TRUE
  101. TRUE
  102. TRUE
  103. TRUE
  104. TRUE
  105. TRUE
  106. TRUE
  107. TRUE
  108. TRUE
  109. TRUE
  110. TRUE
  111. TRUE
  112. TRUE
  113. FALSE
  114. FALSE
PC1PC2PC3Channel
VarIntenStatusCh3-0.017894416-0.039943703 0.12350926 Channel 3
VarIntenStatusCh4 0.054587197-0.074589783-0.03461269 Channel 4
WidthCh1 0.040588348 0.196379600 0.06061612 Channel 1
WidthStatusCh1 0.062667028 0.001511114-0.02338266 Channel 1
XCentroid 0.017941153-0.012457793-0.01618352 NA
YCentroid 0.005085324-0.036591931-0.03760040 NA
In [59]:
segRot <- segRot[complete.cases(segRot),]
segRot$Channel <- factor(as.character(segRot$Channel))
In [60]:
# (C) Kuhn and Johnson, 2013
## Plot a scatterplot matrix of the first three rotation variables

transparentTheme(pchSize = .8, trans = .7)
panelRange <- extendrange(segRot[, 1:3])
library(ellipse)
upperp <- function(...)
  {
    args <- list(...)
    circ1 <- ellipse(diag(rep(1, 2)), t = .1)
    panel.xyplot(circ1[,1], circ1[,2],
                 type = "l",
                 lty = trellis.par.get("reference.line")$lty,
                 col = trellis.par.get("reference.line")$col,
                 lwd = trellis.par.get("reference.line")$lwd)
    circ2 <- ellipse(diag(rep(1, 2)), t = .2)
    panel.xyplot(circ2[,1], circ2[,2],
                 type = "l",
                 lty = trellis.par.get("reference.line")$lty,
                 col = trellis.par.get("reference.line")$col,
                 lwd = trellis.par.get("reference.line")$lwd)
    circ3 <- ellipse(diag(rep(1, 2)), t = .3)
    panel.xyplot(circ3[,1], circ3[,2],
                 type = "l",
                 lty = trellis.par.get("reference.line")$lty,
                 col = trellis.par.get("reference.line")$col,
                 lwd = trellis.par.get("reference.line")$lwd)
    panel.xyplot(args$x, args$y, groups = args$groups, 
                 subscripts = args$subscripts)
  }
splom(~segRot[, 1:3],
      groups = segRot$Channel,
      lower.panel = function(...){}, upper.panel = upperp,
      prepanel.limits = function(x) panelRange,
      auto.key = list(columns = 2))

Data imputation: covering up missing data

Many methods cannot handle missing data and omit those observations from portions of their calculations. When a data set has many predictors and only one item is missing (NA), it doesn't make sense to ignore all the other information that observation can provide. Therefore we can impute those values using the mean or nearby observations.

The Hmisc function allows multivariate imputation. There are several comparable imputation functions for R.

See Alzola CF, Harrell FE (2004): An Introduction to S and the Hmisc and Design Libraries at http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf for extensive documentation and examples for the Hmisc package.

In [61]:
colSums(is.na(segRot))
PC1
0
PC2
0
PC3
0
Channel
0
In [62]:
set.seed(0)
segRot.mis <- prodNA(segRot, noNA=0.1)
is.mis <- which(rowSums(is.na(segRot.mis)) > 0)
head(segRot.mis[is.mis,],10)
PC1PC2PC3Channel
AngleStatusCh1-0.01880805 NA-0.002674549Channel 1
AvgIntenCh2 NA 0.035323460 0.031343661NA
AvgIntenCh3-0.06613143 NA-0.204798182Channel 3
EntropyIntenStatusCh1 0.02031754 -0.008497681 NAChannel 1
EntropyIntenStatusCh4 NA -0.097077507 0.021364260Channel 4
EqCircDiamCh1 0.07837945 0.217856721 0.086038312NA
EqEllipseLWRCh1 0.14124378 NA NAChannel 1
EqSphereVolCh1 0.07759638 0.217234697 NAChannel 1
FiberLengthCh1 NA 0.132171828 0.001572490Channel 1
IntenCoocASMCh3-0.07894645 NA 0.259177478Channel 3
In [63]:
rbind(colMeans(segRot.mis[,1:3],na.rm=T), apply(segRot.mis[,1:3],2,sd,na.rm=T))
PC1PC2PC3
0.01069827 0.04088719 0.009689616
0.09324863 0.08844576 0.096122299
In [64]:
DImodel <- aregImpute(~ PC1+PC2+PC3, data=segRot.mis)
Iteration 8 
In [65]:
segRot.impute = segRot.mis
segRot.impute[is.na(segRot.mis[,1]),1] = rowMeans(DImodel$imputed$PC1)
segRot.impute[is.na(segRot.mis[,2]),2] = rowMeans(DImodel$imputed$PC2)
segRot.impute[is.na(segRot.mis[,3]),3] = rowMeans(DImodel$imputed$PC3)
head(segRot.impute[is.mis,],10)
PC1PC2PC3Channel
AngleStatusCh1-0.018808055 0.070322644-0.002674549Channel 1
AvgIntenCh2-0.004395498 0.035323460 0.031343661NA
AvgIntenCh3-0.066131428 0.005047225-0.204798182Channel 3
EntropyIntenStatusCh1 0.020317544-0.008497681 0.006630911Channel 1
EntropyIntenStatusCh4 0.046664566-0.097077507 0.021364260Channel 4
EqCircDiamCh1 0.078379454 0.217856721 0.086038312NA
EqEllipseLWRCh1 0.141243785 0.048522622-0.020572800Channel 1
EqSphereVolCh1 0.077596383 0.217234697 0.009157835Channel 1
FiberLengthCh1 0.034362508 0.132171828 0.001572490Channel 1
IntenCoocASMCh3-0.078946455 0.009169956 0.259177478Channel 3
In [66]:
plotsize(8,3)
par(mfrow=c(1,3))
for (j in 1:3){
    isx = is.na(segRot.mis[,j])
    plot(segRot.impute[isx,j], segRot[isx,j], asp=1)
    abline(0,1,col='red',lty=2)
}
In [67]:
DImodel2 <- aregImpute(Channel ~ PC1 + PC2 + PC3, data=segRot.impute)
Warning message in aregImpute(Channel ~ PC1 + PC2 + PC3, data = segRot.impute):
“Channel has the following levels with < 5 observations: Channel 2 
Consider using the group parameter to balance bootstrap samples”
Iteration 5 
In [68]:
str(DImodel2$cat.levels)
List of 4
 $ Channel: chr [1:4] "Channel 1" "Channel 2" "Channel 3" "Channel 4"
 $ PC1    : NULL
 $ PC2    : NULL
 $ PC3    : NULL
In [69]:
cbind(DImodel2$imputed$Channel, segRot$Channel[is.na(segRot.mis$Channel)])
AvgIntenCh2111142
EqCircDiamCh1241111
IntenCoocMaxCh4423244
IntenCoocMaxStatusCh4143434
KurtIntenCh3333333
KurtIntenStatusCh1131111
ShapeBFRCh1134111
SkewIntenStatusCh3114113
TotalIntenStatusCh1111311
TotalIntenStatusCh3111113

Given the limited data set size, the imputation isn't poor. For 7 of 10 data points, the true missing Channel class has at least as many imputed guesses as the other Channel classes...

Removing correlated variables

Correlated variables cause a variety of problems with prediction methods. For example, collinearity (correlation near $\pm 1$ caused linear fit coefficients to be highly variable and degrades the meaning of variable importance in a random forest. The following functions remove correlated variable from data in preparation for analysis.

In [70]:
# (C) Kuhn and Johnson, 2013
################################################################################
### Section 3.5 Removing Variables

## To filter on correlations, we first get the correlation matrix for the 
## predictor set

segCorr <- cor(segTrainTrans)

plotsize(8,8)
corrplot(segCorr, order = "hclust", tl.cex = .35)
In [71]:
# (C) Kuhn and Johnson, 2013
## caret's findCorrelation function is used to identify columns to remove.
highCorr <- findCorrelation(segCorr, .75)
highCorr
  1. 29
  2. 35
  3. 37
  4. 38
  5. 39
  6. 40
  7. 45
  8. 49
  9. 57
  10. 58
  11. 61
  12. 62
  13. 71
  14. 79
  15. 85
  16. 87
  17. 88
  18. 89
  19. 97
  20. 99
  21. 100
  22. 102
  23. 105
  24. 108
  25. 13
  26. 5
  27. 7
  28. 8
  29. 17
  30. 19
  31. 3
  32. 4
  33. 34
  34. 25
  35. 50
  36. 51
  37. 52
  38. 73
  39. 31
  40. 6
  41. 9
  42. 18
  43. 22
In [72]:
segCorr2 <- segCorr[-highCorr,-highCorr]
plotsize(6,6)
corrplot(segCorr2, order = "hclust", tl.cex = .35)

Creating dummy variables

Factor variables (categorical levels) should be turned into indicator variables (1/0 values based on membership).

In [73]:
# (C) Kuhn and Johnson, 2013
################################################################################
### Section 3.8 Computing (Creating Dummy Variables)

data(cars)
type <- c("convertible", "coupe", "hatchback", "sedan", "wagon")
cars$Type <- factor(apply(cars[, 14:18], 1, function(x) type[which(x == 1)]))
head(cars)  
    # Note that the car's model has already been converted into dummy variables 
    #     Buick, Cadillac, etc.
PriceMileageCylinderDoorsCruiseSoundLeatherBuickCadillacChevyPontiacSaabSaturnconvertiblecoupehatchbacksedanwagonType
22661.05 20105 6 4 1 0 0 1 0 0 0 0 0 0 0 0 1 0 sedan
21725.01 13457 6 2 1 1 0 0 0 1 0 0 0 0 1 0 0 0 coupe
29142.71 31655 4 2 1 1 1 0 0 0 0 1 0 1 0 0 0 0 convertible
30731.94 22479 4 2 1 0 0 0 0 0 0 1 0 1 0 0 0 0 convertible
33358.77 17590 4 2 1 1 1 0 0 0 0 1 0 1 0 0 0 0 convertible
30315.17 23635 4 2 1 0 0 0 0 0 0 1 0 1 0 0 0 0 convertible
In [74]:
set.seed(0)
carSubset <- cars[sample(1:nrow(cars), 20), c(1, 2, 19)]

head(carSubset)
levels(carSubset$Type)
PriceMileageType
72116456.97 20200 hatchback
21419981.13 24323 sedan
29921757.05 1853 sedan
45914077.97 17445 sedan
72715639.04 8507 coupe
16220627.66 20770 sedan
  1. 'convertible'
  2. 'coupe'
  3. 'hatchback'
  4. 'sedan'
  5. 'wagon'
In [75]:
simpleMod <- dummyVars(~Mileage + Type,
                       data = carSubset,
                       ## Remove the variable name from the column name
                       levelsOnly = TRUE)
simpleMod
Dummy Variable Object

Formula: ~Mileage + Type
2 variables, 1 factors
Factor variable names will be removed
A less than full rank encoding is used
In [76]:
head(predict(simpleMod, carSubset))
Mileageconvertiblecoupehatchbacksedanwagon
721202000 0 1 0 0
214243230 0 0 1 0
299 18530 0 0 1 0
459174450 0 0 1 0
727 85070 1 0 0 0
162207700 0 0 1 0
In [77]:
withInteraction <- dummyVars(~Mileage + Type + Mileage:Type,
                             data = carSubset,
                             levelsOnly = TRUE)
withInteraction
predict(withInteraction, head(carSubset))
Dummy Variable Object

Formula: ~Mileage + Type + Mileage:Type
2 variables, 1 factors
Factor variable names will be removed
A less than full rank encoding is used
MileageconvertiblecoupehatchbacksedanwagonMileage:convertibleMileage:coupeMileage:hatchbackMileage:sedanMileage:wagon
721202000 0 1 0 0 0 0 20200 00
214243230 0 0 1 0 0 0 0243230
299 18530 0 0 1 0 0 0 0 18530
459174450 0 0 1 0 0 0 0174450
727 85070 1 0 0 0 0 8507 0 00
162207700 0 0 1 0 0 0 0207700

Adding appropriate interaction terms can better explain how predictors relate to the response.

$y = \beta_0 + \beta_1\,(x_1-\mu_1) + \beta_2\,(x_2-\mu_2)$ may fail to capture how $x_1$ and $x_2$ interact in their effect on $y$.

$y = \beta_0 + \beta_1\,(x_1-\mu_1) + \beta_2\,(x_2-\mu_2) + \beta_{12}\,(x_1-\mu_1)(x_2-\mu_2)$ may be a more appropriate model for data that has an interaction effect.

Part 3. Finding a good prediction model

Quantitative measures of performance

Measuring the performance quality of the various prediction models to select from is important in the modeling process. Some models will fit a sample well, but fail to give good predictive values for future observations.

When the population is known

Assume we know all $N$ elements of a population having predictors $\mathbf{x}_i$ and response $y$. A common measure of predictive performance of function $f(\mathbf{x})$ is the mean squared error (MSE) as given by $$MSE(f) = \frac1N \sum_{i=1}^N\left(y_i - f(\mathbf{x}_i)\right)^2.$$ The better our prediction model $f(\mathbf{x}_i)$ approximates $y_i$ given the $\mathbf{x}_i$ in the population we are predicting, the lower the MSE for $f$.

Note that $f(\mathbf{x})$ does not have to approximate $y$ well for just any $\mathbf{x}$, only for the part of the predictor space $\mathbb{R}^p$ when the population members $\{\mathbf{x}_i\}$ exist.

When the population is unknown

Most likely the population is unknown so the previous MSE formula cannot be used. If you have an observed sample of size $n$ from the system you are modeling, then the sample MSE of prediction function $f$ is $$MSE(f) = \frac1n \sum_{i=1}^n\left(y_i - f(\mathbf{x}_i)\right)^2.$$

In general we don't care about the training MSE, there are a multitude of models that can yield close to zero MSE but would not be appropriate for predicting future observations. (An MSE of zero won't be possible if there are duplicate observations that have different response values.)

However, if we have a set of observations $(\mathbf{x}_i, y_i), i = 1, \dots, n$, that are independent from a different set of observations $(\mathbf{x}_i, y_i), i = n+1, \dots, n+m$, then we can use the first $n$ observations to train the prediction model $f(\mathbf{x})$, as you would train a dog to do tricks. The trick here is finding a model that generates predictive values for new observations $\mathbf{x}$ such $f(\mathbf{x})$ is close to its true values $y$.

We can assess how the trained function $f(\mathbf{x})$ will perform on the population (or part of the population) we want to predict by using the additional $m$ observations not used in training. Since we know the $y$ for those $m$ values, we can calculate our model's test MSE using $$MSE_{test}(f) = \frac1m \sum_{i=n+1}^{n+m}\left(y_i - f(\mathbf{x}_i)\right)^2.$$

This test MSE is a good estimate of how well our function $f$ will predict values for $\mathbf{x}$ in the population, but only if these $m$ observations are representative of the observations we will encounter when using implementing our prediction model.

$$\Big[\; n+m \; \text{observations} \; \Big] \quad \longrightarrow \quad \Big[\; n \; \text{training observations} \; \Big] + \Big[\; m \; \text{testing observations} \; \Big]$$

Cross-set testing

What if the $n$ observations are also representative of the future data we'll predict? Our performance assessment for the model-generated function $f$ would be more accurate if we include the $n$ training observations along with the $m$ observations of the test set.

But if we calculate the training MSE for $f$ equal to $$MSE_{train}(f) = \frac1n \sum_{i=1}^{n}\left(y_i - f(\mathbf{x}_i)\right)^2$$ we do not get a good assessment of how $f$ will predict new data because $f$ learns to predict those $n$ observations well (so $MSE_{train}(f)$ will be low no matter how well $f$ predicts new observations).

What if we flip the roles of $n$ and $m$ above, and use the $m$ observations to train $f$ and then the $n$ observations for testing? $$\Big[\; n+m \; \text{observations} \; \Big] \quad \longrightarrow \quad \Big[\; n \; \text{testing observations} \; \Big] + \Big[\; m \; \text{training observations} \; \Big]$$

When we train our prediction model on the $m$ observations, we will get a different $f^{(m)}$ than the one from training on the $n$ observations $f^{(n)}$. That doesn't matter because we are assessing the prediction model that generates the functions $f(\mathbf{x})$ rather than the specific functions $f$.

Let's do this: using the $n$ observations our prediction model gives function $f^{(n)}(\mathbf{x})$ that we assess with the $m$ observations, and then do this again using the $m$ observations in our prediction model to get function $f^{(m)}(\mathbf{x})$ assessed with the $n$ observations. Our overall assessed MSE is $$MSE(f) = \frac1{n+m} \left( \sum_{i=1}^{n}\left(y_i - f^{(m)}(\mathbf{x}_i)\right)^2 + \sum_{i=n+1}^{n+m}\left(y_i - f^{(n)}(\mathbf{x}_i)\right)^2 \right).$$

This MSE is likely the best assessment so far given that all $n+m$ observations are representative of the data we will have to predict.

$k$-fold testing

The implementation above is flawed in that if $m$ is small relative to $n$, then the function $f^{(m)}$ is not likely to give good results due to its small training set size. Instead we can use equal-sized folds of the data to assess our prediction model $f$. Let's say we divide the data into 3 folds (and this example can be easily expanded to $k$ folds). Then we perform the following:

  • Train the model on folds 2 + 3, and calculate $$MSE_1(f) = \frac1{n_1}\sum_{fold\ 1} \left(y_i - f^{(2+3)}(\mathbf{x}_i)\right)^2$$
  • Train the model on folds 1 + 3, and calculate $$MSE_2(f) = \frac1{n_2}\sum_{fold\ 2} \left(y_i - f^{(1+3)}(\mathbf{x}_i)\right)^2$$
  • Train the model on folds 1 + 2, and calculate $$MSE_3(f) = \frac1{n_3}\sum_{fold\ 3} \left(y_i - f^{(1+2)}(\mathbf{x}_i)\right)^2$$

where $n_j$ is the number of observations in fold $j$.

Then the overall assessment of model $f$ is the average of these three $MSE_k$ values, or more properly: $$MSE(f) = \sum_{j=1}^k \frac{n_j}n\, MSE_j(f),$$ with the total number of observations $n$ being equal to ${\sum_{j=1}^k n_j}$.

Using $k=5$ or 10 folds should give an accurate assessment of the prediction model under these two important assumptions:

  1. all observations are equally representative of the data $\mathbf{x}$ to be predicted
    • does the sample span the same part of the predictor space in $\mathbb{R}^p$ as the population to be predicted?
  2. the observations of each of the $k$ folds are independent to the observations in all other folds

$k$-fold assessment works well because:

  1. the individual error sums $MSE_j(f)$ are on data not used in formulating the function $f^{(-j)}$
  2. all observations are used to assess performance
  3. in general, a higher $k$ is better because more data is used to form the prediction functions, meaning the MSE terms should be lower.

After a prediction model is selected based on its lowest $k$-fold MSE, the final model is trained with all $n$ observations, and that model's $f(\mathbf{x})$ is used for predicting values on new $\mathbf{x}$.

LOOCV

Carrying this idea to the extreme is Leave One Out Cross-Validation, using $k=n$. Some linear methods have convenient calculations for this metric (called generalized cross-validation) so the model doesn't have to be run $k$ times (each time without the observation used to predict).

Simplicity vs. accuracy: The Bias-Variance Trade-off

As you increase the model complexity, you risk "overfitting" your observations by thinking random deviations in the training data are indicative of the true relationship between $\mathbf{x}$ and $y$, when in reality the model is just fitting noise.

But if your model is too simple, then the true relationship between $\mathbf{x}$ and $y$ cannot be accounted for with the function $f$. We say that model bias exists when the expected value of the fit $E(f)$ does not equal the mean truth value $\mu = E(y)$. The goal is to predict $\mu(\mathbf{x})$ for all $\mathbf{x}$.

Putting this together in the expected MSE formula: $$E[MSE] = \sigma^2 + (\text{Model Bias})^2 + (\text{Model Variance}).$$

The first term, $\sigma^2$ is due to the noise in the system, the variation that cannot be modeled. Your model can never do better than $\sigma^2$ as this portion is the irreducible error that cannot be predicted. This part is equal to the expected variance of the response about its mean value (the truth, $\mu(\mathbf{x})$): $$ \sigma^2 = E(y - \mu(\mathbf{x}))^2.$$

The second term, squared model bias, is the expected squared bias, or amount that our fit will differ from the truth $\mu(\mathbf{x})$ considering all possible training data sets we could have: $$ \text{Model Bias} = E_\mathcal{T}\big(f(\mathbf{x})\big) - \mu(\mathbf{x}).$$

The third term, model variance, is the expected contribution to the MSE due to the variation of the fits (considering all possible training data sets) about its mean fit: $$ \text{Model Variance} = E_\mathcal{T}\Big(f(\mathbf{x}) - E_\mathcal{T}\big(f(\mathbf{x})\big)\Big)^2.$$

Calculation of the MSE for $f$ takes place over the appropriate range of $\mathbf{x}$, as the above quantities are defined at a single point $\mathbf{x}$.

Example 1: linear fit vs. 7th-order polynomial

Imagine fitting a collection of 10 points located at 1 through 10 with a linear fit or 7th-order polynomial when the truth is $\mu = x$.

Five different training data sets (from the infinite possible training sets) are:

In [78]:
set.seed(0)
train = data.frame(run = rep(1:5,each=10), x = rep(1:10,5))
train$y = train$x + rnorm(nrow(train))
xs = seq(0.5,10.5,,1000)
test = data.frame(run = NA, x = xs, y = NA)
for (j in 2:7){
    train[,j+2] = train$x ^ j
    test[,j+2] = test$x ^ j
}
fit1 = fit7 = matrix(NA, max(train$run), length(xs))
for (r in 1:max(train$run)){
    lm1 = lm(y ~ x, data=train[train$run == r,-1])
    fit1[r,] = predict(lm1, test)
    lm7 = lm(y ~ ., data=train[train$run == r,-1])
    fit7[r,] = predict(lm7, test)    
}
In [79]:
plotsize(8,4)
par(mfrow=c(1,3))
plot(train$x, train$y, pch=train$run, col=train$run, xlab='x', ylab='y', las=1, 
     main='5 training sets')
abline(0,1,col='orange',lwd=2)
plot(train$x, train$y, pch=train$run, col=train$run, xlab='x', ylab='y', las=1, 
     main='linear fits have\nlow bias and low variance')
for (r in 1:max(train$run)){
    lines(xs,fit1[r,],col=r)
}
abline(0,1,col='orange',lwd=2)
plot(train$x, train$y, pch=train$run, col=train$run, xlab='x', ylab='y', las=1,
     main='7th-order polynomial fits have\nlow bias but high variance')
for (r in 1:max(train$run)){
    lines(xs,fit7[r,],col=r)
}
abline(0,1,col='orange',lwd=2)
In [80]:
T = 1000
xs = seq(0.5,10.5,,1000)
results1 = results7 = matrix(0, T, length(xs))
for (trial in 1:T){
    set.seed(trial)
    train = data.frame(x = 1:10)
    train$y = train$x + rnorm(nrow(train))
    test = data.frame(x = xs, y = NA)
    for (j in 2:7){
        train[,j+1] = train$x ^ j
        test[,j+1] = test$x ^ j
    }
    lm1 = lm(y ~ x, data=train)
    results1[trial,] = predict(lm1, test)
    lm7 = lm(y ~ ., data=train)
    results7[trial,] = predict(lm7, test)
}
In [81]:
plotsize(7,4)
par(mfrow=c(1,2))
plot(c(1,10),c(-1,12), type='n', xlab='x', ylab='y', las=1, 
     main='range of 80% of\nlinear fits')
abline(0,1,col='orange',lwd=2)
lines(xs, colMeans(results1), lty=2)
lines(xs, apply(results1,2,quantile,prob=.9), col='red', lty=2)
lines(xs, apply(results1,2,quantile,prob=.1), col='red', lty=2)
legend('topleft',c('mean','P10,P90'),lty=2,col=c(1,2),cex=1,y.intersp=2.5)
plot(c(1,10),c(-1,12), type='n', xlab='x', ylab='y', las=1, 
     main='range of 80% of\n7th-order polynomial fits')
abline(0,1,col='orange',lwd=2)
lines(xs, colMeans(results7), lty=2)
lines(xs, apply(results7,2,quantile,prob=.9), col='red', lty=2)
lines(xs, apply(results7,2,quantile,prob=.1), col='red', lty=2)
legend('topleft',c('mean','P10,P90'),lty=2,col=c(1,2),cex=1,y.intersp=2.5)

The fact the mean fit aligns with the truth means both methods are unbiased, or $E(f) = \mu$.

Example 2: MSE decomposed into squared bias and variance

Let the truth function be $\mu(\mathbf{x}) = 5 (x + x^2 + x^3)$ on the range $(-1, 1)$. We'll fit polynomials from order 1 to order 5 and confirm that a third-order polynomial is best in terms of MSE.

In [82]:
n = 20
J = 250
P = 5
set.seed(0)
x <- runif(n*J, -1, 1)
xte <- seq(-1,1,,1000)
y <- 5*x + 5*x^2 + 5*x^3 + rnorm(n*J)
yte <- 5*xte + 5*xte^2 + 5*xte^3
TrainMSE= rep(0,P)
SqBias = rep(NA,P)
MSE = rep(0,P)
for (p in 1:P){
    train = test = data.frame(y=y)
    for (i in 1:p){
        train[,i+1] = x^i
        test[,i+1] = xte^i
    }
    overall.lm = lm(y~., data=train)
    yhat = predict(overall.lm, test)
    SqBias[p] = sum((yhat - yte)^2)/length(yte)
    for (j in 1:J){
        obs = 1:n + n*(j-1)
        indiv.lm = lm(y~., data=train[obs,])
        yhat = predict(indiv.lm, test)
        MSE[p] = MSE[p] + (1/J) * sum((yhat - yte)^2)/length(yte) 
        TrainMSE[p] = TrainMSE[p] + (1/J) * sum(indiv.lm$residuals ^ 2)/n
    }
}
Variance = MSE - SqBias
In [83]:
plotsize(7,4)
par(mfrow=c(1,2))
plot(1:P, SqBias, ylim=c(0,max(MSE)), las=1, 
     main='MSE components vs.\npolynomial complexity',
    ylab='squared error', xlab='model complexity')
points(1:P, Variance, col=2, pch=3)
points(1:P, MSE, col=4, pch=2)
points(1:P, TrainMSE, col=6, pch=4)
legend(2.5,16.5,c('sq.bias','variance','total MSE','train MSE'),
       col=c(1,2,4,6),pch=c(1,3,2,4),y.intersp=2.5, cex=.75)
plot(1:P, SqBias, ylim=c(0.7,max(MSE)), log='y', las=1, 
     main='MSE vs. complexity\non log scale',
    ylab='squared error', xlab='model complexity')
points(1:P, Variance, col=2, pch=3)
points(1:P, MSE, col=4, pch=2)
points(1:P, TrainMSE, col=6, pch=4)

Note that training MSE decreases with model complexity, even though polynomials 4 or 5 do not fit test data well.

Trade-off between Model Interpretability vs. Flexibility

As models grow in complexity through increased flexibility, they are harder to interpret. We will see why LASSO regression (should be as flexible as least squares) is popular due it's flexibility in selecting from a wide range of predictors. Random forests are methods that make complex predictions on a large number of predictors but offer some techniques to assess the importance of the different predictors. Random forests would be located just above Bagging in the chart.

We'll learn most of these methods in this course.

Interpretability vs. flexibility From James et al., Introduction to Statistical Learning (2013)

Model tuning and overfitting

Some prediction models have parameters that determine how it generates fit functions $f$. I think of these parameters as knobs that have to be set in order to run the model. For example, in $k$-nearest neighbors regression, where the fit function uses the average of the $k$ nearest observations in the training data, the value of $k$ must be set to generate prediction values.

In [84]:
plotsize(6,4)
plot(datasets::cars, main='Stopping distance vs. vehicle speed')
In [85]:
knnFit <- train(dist ~ ., data=datasets::cars, method = "knn", metric = "RMSE",
                tuneGrid = data.frame(k=1:9), trControl = trainControl(method = "cv"))
knnFit
k-Nearest Neighbors 

50 samples
 1 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 44, 44, 46, 45, 45, 46, ... 
Resampling results across tuning parameters:

  k  RMSE      Rsquared 
  1  15.34478  0.6883600
  2  14.81315  0.7085822
  3  14.66545  0.7104909
  4  14.85405  0.6775448
  5  14.76945  0.6788361
  6  15.10588  0.6643477
  7  16.26172  0.6614847
  8  16.50983  0.6490712
  9  16.60977  0.6508582

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was k = 3.
In [86]:
xs = seq(0,30,,1000)
knn.model <- knn.reg(datasets::cars$speed, test=data.frame(speed=xs), 
                     y=datasets::cars$dist, k=5)
str(knn.model)
List of 7
 $ call     : language knn.reg(train = datasets::cars$speed, test = data.frame(speed = xs), y = datasets::cars$dist,      k = 5)
 $ k        : num 5
 $ n        : int 1000
 $ pred     : num [1:1000] 10.8 10.8 10.8 10.8 10.8 10.8 10.8 10.8 10.8 10.8 ...
 $ residuals: NULL
 $ PRESS    : NULL
 $ R2Pred   : NULL
 - attr(*, "class")= chr "knnReg"
In [87]:
plot(datasets::cars, main='Predicted stopping distance using\nk-nearest neighbors')
for (k in c(25,10,5)){
    knn.model <- knn.reg(datasets::cars$speed, test=data.frame(speed=xs), 
                         y=datasets::cars$dist, k=k)
    if (k == 5){
        lines(xs, knn.model$pred, col=4, lwd=4)
        } else {
        lines(xs, knn.model$pred, col=ceiling(sqrt(k/0.1)), lwd=4)        
    }
}
legend('topleft',c('k=5','k=10','k=25'),col=c(4,2,8), lwd=3, y.intersp=2)

Many methods in the prediction modeling toolbox have parameters that require setting, like the number of neighbors $k$ here. Proper tuning is important to determine the value to set these parameters in order to get good predictive values. Without tuning these knobs, you run the risk of overfitting the training data, for example when $k=1$, see above, or underfitting the data, such as the gray $k=25$ fit.

Part 4: Example prediction task

Predicting fuel economy for vehicles using a data set with many predictors (from Chapter 2 of Applied Predictive Modeling)

Kuhn and Johnson code header

################################################################################
### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
### Copyright 2013 Kuhn and Johnson
### Web Page: http://www.appliedpredictivemodeling.com
### Contact: Max Kuhn (mxkuhn@gmail.com)
###
### Chapter 2: A Short Tour of the Predictive Modeling Process
###
### Required packages: AppliedPredictiveModeling, earth, caret, lattice
###
### Data used: The FuelEconomy data in the AppliedPredictiveModeling package
###
### Notes: 
### 1) This code is provided without warranty.
###
### 2) This code should help the user reproduce the results in the
### text. There will be differences between this code and what is is
### the computing section. For example, the computing sections show
### how the source functions work (e.g. randomForest() or plsr()),
### which were not directly used when creating the book. Also, there may be 
### syntax differences that occur over time as packages evolve. These files 
### will reflect those changes.
###
### 3) In some cases, the calculations in the book were run in 
### parallel. The sub-processes may reset the random number seed.
### Your results may slightly vary.
###
################################################################################
In [88]:
# (C) Kuhn and Johnson, 2013
################################################################################
### Section 2.1 Case Study: Predicting Fuel Economy

data(FuelEconomy)

## Format data for plotting against engine displacement

## Sort by engine displacement
cars2010 <- cars2010[order(cars2010$EngDispl),]
cars2011 <- cars2011[order(cars2011$EngDispl),]

## Combine data into one data frame
cars2010a <- cars2010
cars2010a$Year <- "2010 Model Year"
cars2011a <- cars2011
cars2011a$Year <- "2011 Model Year"

plotData <- rbind(cars2010a, cars2011a)

library(lattice)
plotsize(6,4)
xyplot(FE ~ EngDispl|Year, plotData,
       xlab = "Engine Displacement",
       ylab = "Fuel Efficiency (MPG)",
       between = list(x = 1.2))
In [89]:
# (C) Kuhn and Johnson, 2013
## Fit a single linear model and conduct 10-fold CV to estimate the error
set.seed(1)
lm1Fit <- train(FE ~ EngDispl, 
                data = cars2010,
                method = "lm", 
                trControl = trainControl(method= "cv"))
lm1Fit
Linear Regression 

1107 samples
   1 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 997, 996, 995, 996, 997, 996, ... 
Resampling results:

  RMSE      Rsquared
  4.604285  0.628494

Tuning parameter 'intercept' was held constant at a value of TRUE
In [90]:
# (C) Kuhn and Johnson, 2013
## Fit a quadratic model too

## Create squared terms
cars2010$ED2 <- cars2010$EngDispl^2
cars2011$ED2 <- cars2011$EngDispl^2

set.seed(1)
lm2Fit <- train(FE ~ EngDispl + ED2, 
                data = cars2010,
                method = "lm", 
                trControl = trainControl(method= "cv"))
lm2Fit
Linear Regression 

1107 samples
   2 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 997, 996, 995, 996, 997, 996, ... 
Resampling results:

  RMSE      Rsquared 
  4.228432  0.6843226

Tuning parameter 'intercept' was held constant at a value of TRUE
In [91]:
# (C) Kuhn and Johnson, 2013
## Finally a MARS model (via the earth package)

set.seed(1)
marsFit <- train(FE ~ EngDispl, 
                 data = cars2010,
                 method = "earth",
                 tuneLength = 15,
                 trControl = trainControl(method= "cv"))
marsFit
Multivariate Adaptive Regression Spline 

1107 samples
   1 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 997, 996, 995, 996, 997, 996, ... 
Resampling results across tuning parameters:

  nprune  RMSE      Rsquared 
  2       4.295551  0.6734579
  3       4.255755  0.6802699
  4       4.224908  0.6849626
  5       4.246868  0.6824923

Tuning parameter 'degree' was held constant at a value of 1
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were nprune = 4 and degree = 1.
In [92]:
plotsize(5,3)
plot(marsFit)
In [93]:
head(cars2011)
EngDisplNumCylTransmissionFEAirAspirationMethodNumGearsTransLockupTransCreeperGearDriveDescIntakeValvePerCylExhaustValvesPerCylCarlineClassDescVarValveTimingVarValveLiftED2
2341.4 4 S6 50.4000 Turbocharged 6 1 0 TwoWheelDriveFront2 2 CompactCars 1 0 1.96
2351.4 4 S6 54.0500 Turbocharged 6 1 0 TwoWheelDriveFront2 2 CompactCars 1 0 1.96
2361.4 4 M6 59.7000 Turbocharged 6 0 0 TwoWheelDriveFront2 2 CompactCars 1 0 1.96
2371.4 4 AV 52.7496 NaturallyAspirated1 0 0 TwoWheelDriveFront2 2 CompactCars 1 0 1.96
241.5 4 M6 52.2000 NaturallyAspirated6 0 0 TwoWheelDriveFront2 2 2Seaters 1 1 2.25
251.5 4 Other 55.6446 NaturallyAspirated1 1 0 TwoWheelDriveFront2 2 2Seaters 1 1 2.25
In [94]:
cars.dummy <- dummyVars(~ Transmission + AirAspirationMethod + DriveDesc + CarlineClassDesc, 
                        data=rbind(cars2010,cars2011))
# combine the 2 data sets in case there are different levels in them
cars.dummy
Dummy Variable Object

Formula: ~Transmission + AirAspirationMethod + DriveDesc + CarlineClassDesc
4 variables, 4 factors
Variables and levels will be separated by '.'
A less than full rank encoding is used
In [95]:
cars2010D <- predict(cars.dummy, cars2010)
cars2011D <- predict(cars.dummy, cars2011)
head(cars2011D)
Transmission.OtherTransmission.A4Transmission.A5Transmission.A6Transmission.A7Transmission.AM6Transmission.AM7Transmission.AVTransmission.AVS6Transmission.M5CarlineClassDesc.SmallPickupTrucks4WDCarlineClassDesc.SmallStationWagonsCarlineClassDesc.SpecialPurposeVehicleminivan2WDCarlineClassDesc.SpecialPurposeVehicleSUV2WDCarlineClassDesc.SpecialPurposeVehicleSUV4WDCarlineClassDesc.StandardPickupTrucks2WDCarlineClassDesc.StandardPickupTrucks4WDCarlineClassDesc.SubcompactCarsCarlineClassDesc.VansCargoTypesCarlineClassDesc.VansPassengerType
23400000000000000000000
23500000000000000000000
23600000000000000000000
23700000001000000000000
2400000000000000000000
2510000000000000000000
In [96]:
quantcol <- sapply(cars2010[1,],is.numeric)
cars2010Q <- cbind(cars2010[,quantcol],cars2010D)
cars2011Q <- cbind(cars2011[,quantcol],cars2011D)
head(cars2010Q)
dim(cars2010Q)
EngDisplNumCylFENumGearsTransLockupTransCreeperGearIntakeValvePerCylExhaustValvesPerCylVarValveTimingVarValveLiftCarlineClassDesc.SmallPickupTrucks4WDCarlineClassDesc.SmallStationWagonsCarlineClassDesc.SpecialPurposeVehicleminivan2WDCarlineClassDesc.SpecialPurposeVehicleSUV2WDCarlineClassDesc.SpecialPurposeVehicleSUV4WDCarlineClassDesc.StandardPickupTrucks2WDCarlineClassDesc.StandardPickupTrucks4WDCarlineClassDesc.SubcompactCarsCarlineClassDesc.VansCargoTypesCarlineClassDesc.VansPassengerType
11301.0 3 57.80005 1 0 2 2 1 0 0 0 0 0 0 0 0 0 0 0
11311.0 3 57.80005 1 0 2 2 1 0 0 0 0 0 0 0 0 0 0 0
12791.3 2 30.20006 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
12801.3 2 32.10006 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
13651.3 4 65.00001 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
13661.3 4 62.26741 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
  1. 1107
  2. 52
In [97]:
colremove <- which(colnames(cars2010Q) %in% c("ED2", "FE"))
cars.pp <- preProcess(cars2010Q[,-colremove], method=c("BoxCox", "center", "scale"))
X10 <- predict(cars.pp, cars2010Q[,-colremove])
X11 <- predict(cars.pp, cars2011Q[,-colremove])
X10$FE <- cars2010$FE
X11$FE <- cars2011$FE
cars.pp
Created from 1107 samples and 50 variables

Pre-processing:
  - Box-Cox transformation (3)
  - centered (50)
  - ignored (0)
  - scaled (50)

Lambda estimates for Box-Cox transformation:
0.1, -0.3, 2
In [98]:
head(X11)
EngDisplNumCylNumGearsTransLockupTransCreeperGearIntakeValvePerCylExhaustValvesPerCylVarValveTimingVarValveLiftTransmission.OtherCarlineClassDesc.SmallStationWagonsCarlineClassDesc.SpecialPurposeVehicleminivan2WDCarlineClassDesc.SpecialPurposeVehicleSUV2WDCarlineClassDesc.SpecialPurposeVehicleSUV4WDCarlineClassDesc.StandardPickupTrucks2WDCarlineClassDesc.StandardPickupTrucks4WDCarlineClassDesc.SubcompactCarsCarlineClassDesc.VansCargoTypesCarlineClassDesc.VansPassengerTypeFE
234-2.269076 -1.190624 0.5244925 0.6853429 -0.2263531 0.3914824 0.4347232 0.4636311 -0.4477382 -0.073788 -0.2557116 -0.1171489 -0.3693602 -0.4257537 -0.1580425 -0.1639432 -0.3419762 -0.09543341-0.060193 50.4000
235-2.269076 -1.190624 0.5244925 0.6853429 -0.2263531 0.3914824 0.4347232 0.4636311 -0.4477382 -0.073788 -0.2557116 -0.1171489 -0.3693602 -0.4257537 -0.1580425 -0.1639432 -0.3419762 -0.09543341-0.060193 54.0500
236-2.269076 -1.190624 0.5244925 -1.4578056 -0.2263531 0.3914824 0.4347232 0.4636311 -0.4477382 -0.073788 -0.2557116 -0.1171489 -0.3693602 -0.4257537 -0.1580425 -0.1639432 -0.3419762 -0.09543341-0.060193 59.7000
237-2.269076 -1.190624 -2.3910686 -1.4578056 -0.2263531 0.3914824 0.4347232 0.4636311 -0.4477382 -0.073788 -0.2557116 -0.1171489 -0.3693602 -0.4257537 -0.1580425 -0.1639432 -0.3419762 -0.09543341-0.060193 52.7496
24-2.084784 -1.190624 0.5244925 -1.4578056 -0.2263531 0.3914824 0.4347232 0.4636311 2.2314304 -0.073788 -0.2557116 -0.1171489 -0.3693602 -0.4257537 -0.1580425 -0.1639432 -0.3419762 -0.09543341-0.060193 52.2000
25-2.084784 -1.190624 -2.3910686 0.6853429 -0.2263531 0.3914824 0.4347232 0.4636311 2.2314304 13.540097 -0.2557116 -0.1171489 -0.3693602 -0.4257537 -0.1580425 -0.1639432 -0.3419762 -0.09543341-0.060193 55.6446

It turns out the MARS algorithm in caret doesn't need explicit dummy variables.

In [99]:
head(cars2010,3)
EngDisplNumCylTransmissionFEAirAspirationMethodNumGearsTransLockupTransCreeperGearDriveDescIntakeValvePerCylExhaustValvesPerCylCarlineClassDescVarValveTimingVarValveLiftED2
11301.0 3 Other 57.8 NaturallyAspirated5 1 0 TwoWheelDriveRear 2 2 2Seaters 1 0 1.00
11311.0 3 Other 57.8 NaturallyAspirated5 1 0 TwoWheelDriveRear 2 2 2Seaters 1 0 1.00
12791.3 2 M6 30.2 NaturallyAspirated6 0 0 TwoWheelDriveRear 0 0 SubcompactCars 0 0 1.69
In [100]:
## Trying to make a better MARS model (via the earth package)
set.seed(1)
startTime <- proc.time()[3]
# Trying nprune in [3,9] and degree in [1,3], the minimum is nprune=8, degree=2.
# To keep runtimes short, I pare the grid to [5,9] and [1,2].
tuneMatrix <- data.frame(nprune = rep(5:9,2), degree = rep(1:2, each=5))
marsFit2 <- train(FE ~ ., 
                 data = cars2010,
                 method = "earth",
                 tuneGrid = tuneMatrix,
                 trControl = trainControl(method= "cv"))
cat(sprintf("Runtime = %.1f sec", proc.time()[3]-startTime))
# takes about 20 sec min to run
marsFit2
Runtime = 21.4 sec
Multivariate Adaptive Regression Spline 

1107 samples
  14 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 997, 996, 995, 996, 997, 996, ... 
Resampling results across tuning parameters:

  degree  nprune  RMSE      Rsquared 
  1       5       3.906037  0.7318517
  1       6       3.868169  0.7373657
  1       7       3.823439  0.7420339
  1       8       3.676657  0.7620382
  1       9       3.681752  0.7614852
  2       5       3.840216  0.7413350
  2       6       3.720717  0.7579110
  2       7       3.626918  0.7699591
  2       8       3.532365  0.7815650
  2       9       3.575115  0.7762788

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were nprune = 8 and degree = 2.
In [101]:
## a knn model based on the scaled data
set.seed(1)
startTime <- proc.time()[3]
knnFit <- train(FE ~ ., 
                 data = X10,
                 method = "knn",
                 tuneLength = 6,
                 trControl = trainControl(method= "cv"))
cat(sprintf("Runtime = %.1f sec", proc.time()[3]-startTime))
# takes about 1 sec to run
knnFit
Runtime = 1.1 sec
k-Nearest Neighbors 

1107 samples
  50 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 997, 996, 995, 996, 997, 996, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared 
   5  4.293863  0.6786391
   7  4.267483  0.6810855
   9  4.226845  0.6876317
  11  4.241529  0.6877355
  13  4.297421  0.6828040
  15  4.299264  0.6840084

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was k = 9.
In [102]:
## Fit a linear model on everything
set.seed(1)
lm3Fit <- train(FE ~ ., 
                data = X10,
                method = "lm", 
                trControl = trainControl(method= "cv"))
lm3Fit
Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”
Linear Regression 

1107 samples
  50 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 997, 996, 995, 996, 997, 996, ... 
Resampling results:

  RMSE      Rsquared 
  3.388487  0.7990773

Tuning parameter 'intercept' was held constant at a value of TRUE
In [103]:
# Removing correlated variables
X10Corr <- cor(X10)

plotsize(8,8)
corrplot(X10Corr, order = "hclust", tl.cex = .35)
In [104]:
FEcol <- which(colnames(X10) %in% "FE")
highCorr <- findCorrelation(X10Corr[-FEcol,-FEcol], .9)
colRemove <- colnames(X10)[highCorr]
colRemove
  1. 'EngDispl'
  2. 'ExhaustValvesPerCyl'
  3. 'AirAspirationMethod.NaturallyAspirated'
In [105]:
X10R <- X10[,-which(colnames(X10) %in% colRemove)]
X11R <- X11[,-which(colnames(X10) %in% colRemove)]
In [106]:
colnames(X10R)
  1. 'NumCyl'
  2. 'NumGears'
  3. 'TransLockup'
  4. 'TransCreeperGear'
  5. 'IntakeValvePerCyl'
  6. 'VarValveTiming'
  7. 'VarValveLift'
  8. 'Transmission.Other'
  9. 'Transmission.A4'
  10. 'Transmission.A5'
  11. 'Transmission.A6'
  12. 'Transmission.A7'
  13. 'Transmission.AM6'
  14. 'Transmission.AM7'
  15. 'Transmission.AV'
  16. 'Transmission.AVS6'
  17. 'Transmission.M5'
  18. 'Transmission.M6'
  19. 'Transmission.S4'
  20. 'Transmission.S5'
  21. 'Transmission.S6'
  22. 'Transmission.S7'
  23. 'Transmission.S8'
  24. 'AirAspirationMethod.Supercharged'
  25. 'AirAspirationMethod.Turbocharged'
  26. 'DriveDesc.AllWheelDrive'
  27. 'DriveDesc.FourWheelDrive'
  28. 'DriveDesc.ParttimeFourWheelDrive'
  29. 'DriveDesc.TwoWheelDriveFront'
  30. 'DriveDesc.TwoWheelDriveRear'
  31. 'CarlineClassDesc.Other'
  32. 'CarlineClassDesc.2Seaters'
  33. 'CarlineClassDesc.CompactCars'
  34. 'CarlineClassDesc.LargeCars'
  35. 'CarlineClassDesc.MidsizeCars'
  36. 'CarlineClassDesc.MinicompactCars'
  37. 'CarlineClassDesc.SmallPickupTrucks2WD'
  38. 'CarlineClassDesc.SmallPickupTrucks4WD'
  39. 'CarlineClassDesc.SmallStationWagons'
  40. 'CarlineClassDesc.SpecialPurposeVehicleminivan2WD'
  41. 'CarlineClassDesc.SpecialPurposeVehicleSUV2WD'
  42. 'CarlineClassDesc.SpecialPurposeVehicleSUV4WD'
  43. 'CarlineClassDesc.StandardPickupTrucks2WD'
  44. 'CarlineClassDesc.StandardPickupTrucks4WD'
  45. 'CarlineClassDesc.SubcompactCars'
  46. 'CarlineClassDesc.VansCargoTypes'
  47. 'CarlineClassDesc.VansPassengerType'
  48. 'FE'
In [107]:
## Fit a linear model on everything
set.seed(1)
lm4Fit <- train(FE ~ ., 
                data = X10R,
                method = "lm", 
                trControl = trainControl(method= "cv"))
lm4Fit
Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”
Linear Regression 

1107 samples
  47 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 997, 996, 995, 996, 997, 996, ... 
Resampling results:

  RMSE      Rsquared 
  3.656657  0.7684956

Tuning parameter 'intercept' was held constant at a value of TRUE
In [108]:
# (C) Kuhn and Johnson, 2013
## Predict the test set data
cars2011$lm1  <- predict(lm1Fit,  cars2011)
cars2011$lm2  <- predict(lm2Fit,  cars2011)
cars2011$mars1 <- predict(marsFit, cars2011)
cars2011$mars2 <- predict(marsFit2, cars2011)
cars2011$knn <- predict(knnFit, X11)

## Get test set performance values via caret's postResample function
postResample(pred = cars2011$lm1,  obs = cars2011$FE)
postResample(pred = cars2011$lm2,  obs = cars2011$FE)
postResample(pred = cars2011$mars1, obs = cars2011$FE)
postResample(pred = cars2011$mars2, obs = cars2011$FE)
postResample(pred = cars2011$knn, obs = cars2011$FE)
RMSE
5.16253091526297
Rsquared
0.701864184158507
RMSE
4.7162853433046
Rsquared
0.748607360003538
RMSE
4.68555005425347
Rsquared
0.7499953328713
RMSE
4.06591779298863
Rsquared
0.809649769233236
RMSE
5.43283748163835
Rsquared
0.671063847256909
In [109]:
cars2011$lm3 <- predict(lm3Fit, X11)
postResample(pred = cars2011$lm3, obs = cars2011$FE)
cars2011$lm4 <- predict(lm4Fit, X11R)
postResample(pred = cars2011$lm4, obs = cars2011$FE)
Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”
RMSE
3.65250501328522
Rsquared
0.849466457743109
Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”
RMSE
4.0399658770505
Rsquared
0.822500142964803
In [110]:
plotsize(7,4)
plot(cars2011$EngDispl, cars2011$FE, cex=.5)
lines(cars2011$EngDispl, cars2011$lm2, col=2)
points(cars2011$EngDispl, cars2011$lm3, col=4, pch=2, cex=.6)
In [111]:
plotsize(7,6)
plot(cars2011$EngDispl, cars2011$FE, cex=.5)
lines(cars2011$EngDispl, cars2011$lm2, col=2)
for (i in 1:nrow(cars2011)){
    lines(rep(cars2011$EngDispl[i],2),c(cars2011$FE[i],cars2011$lm2[i]), col=2, lwd=2)
    lines(rep(cars2011$EngDispl[i]+runif(1,-.1,.1),2),
          c(cars2011$FE[i],cars2011$lm3[i]), col=4, lwd=2)
}
In [112]:
plotsize(8,8)
par(mfrow=c(2,1))
orderx = order(cars2011$EngDispl)
plot(1:nrow(cars2011), cars2011$FE[orderx], cex=.5, xlab='Observation', 
     ylab='Fuel Eff. residual',
    main='Quadratic fit')
for (i in 1:nrow(cars2011)){
    lines(rep(i,2),c(cars2011$FE[orderx[i]],cars2011$lm2[orderx[i]]), col=2, lwd=2)
}
plot(1:nrow(cars2011), cars2011$FE[orderx], cex=.5, xlab='Observation', 
     ylab='Fuel Eff. residual',
    main='Multivariate linear fit')
for (i in 1:nrow(cars2011)){
    lines(rep(i,2),c(cars2011$FE[orderx[i]],cars2011$lm3[orderx[i]]), col=4, lwd=2)
}

The multivariate linear fit seems to provide better fits than the quadratic fit of engine displacement, despite the warnings of the data matrix $\mathbf{X}$ being rank deficient.